-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small tweaks to the devdocs #72
Conversation
Codecov Report
@@ Coverage Diff @@
## master #72 +/- ##
=======================================
Coverage 92.91% 92.91%
=======================================
Files 26 26
Lines 2836 2836
=======================================
Hits 2635 2635
Misses 201 201
Continue to review full report at Codecov.
|
I updated the Manifest file to Documenter 0.24.5, and added a little more to You can also reciprocal throughput and latency, e.g.: using CpuId, VectorizationBase, SIMDPirates, SLEEFPirates, VectorizedRNG
@generated function estimate_cost_onearg(f::F, N::Int = 512, K = 1_000, ::Type{T} = Float64, ::Val{U} = Val(4)) where {F,T,U}
W, Wshift = VectorizationBase.pick_vector_width_shift(T)
quote
Base.Cartesian.@nexprs $U u -> s_u = vbroadcast(Vec{$W,$T}, zero(T))
# s = vbroadcast(V, zero(T))
x = rand(T, N << $Wshift)
ptrx = pointer(x)
ts_start, id_start = cpucycle_id()
for k ∈ 1:K
_ptrx = ptrx
for n ∈ 1:N>>$(VectorizationBase.intlog2(U))
Base.Cartesian.@nexprs $U u -> begin
v_u = vload(Vec{$W,$T}, _ptrx)
s_u = vadd(s_u, f(v_u))
_ptrx += VectorizationBase.REGISTER_SIZE
end
end
end
ts_end, id_end = cpucycle_id()
@assert id_start == id_end
Base.Cartesian.@nexprs $(U-1) u -> s_1 = vadd(s_1, s_{u+1})
(ts_end - ts_start) / (N*K), vsum(s_1)
end
end I'm sure this could be improved. It adds an extra add and load instruction, but my concern was that LLVM may optimize things away. julia> first(estimate_cost_onearg(SLEEFPirates.log, 512, 10_000, Float64, Val(1))) # 51 cycles # 44
13.4911943359375
julia> first(estimate_cost_onearg(SLEEFPirates.log, 512, 10_000, Float64, Val(2))) # 51 cycles # 40
13.177637109375
julia> first(estimate_cost_onearg(SLEEFPirates.log, 512, 10_000, Float64, Val(4))) # 51 cycles # 39
13.1288251953125
julia> first(estimate_cost_onearg(SLEEFPirates.exp, 512, 10_000, Float64, Val(1))) # 51 cycles # 44
14.2456966796875
julia> first(estimate_cost_onearg(SLEEFPirates.exp, 512, 10_000, Float64, Val(2))) # 51 cycles # 40
14.753721484375
julia> first(estimate_cost_onearg(SLEEFPirates.exp, 512, 10_000, Float64, Val(4))) # 51 cycles # 39
13.128287109375 Let me know if you think it's ready and I'll merge it. |
Most of them came from there (specifically, Skylake-X), so I've now mentioned that. I used the approach with
The unit is "number of floating point registers". AVX512 systems have 32, and other x86 cpus have 16 registers. It is set to 0 for most instructions, under the heuristic assumption that they won't net-consume any extra registers. I think that is unlikely to be wrong in practice; generally at least one of the arguments won't be used anymore, so that the register it occupied now becomes available. The primary exceptions are
The register pressure comes into play when solving for tile size. That is, if it is considering unrolling 2 loops, it solves the constrained optimization problem of minimizing cost without consuming any more registers than the CPU cores have available.
Looking at the instruction tables you linked, on page 271 (Skylake-X), reciprocal throughput divsd: 4 sqrtsd: 4-6 Most instructions fell into one of two categories: either their cost was independent of the length of the vectors*, or increased in direct proportion to vector length. I could probably represent this with bigger tables, but for now it's mostly using sentinel values to indicate how the instruction cost will change as a function of vector width. *If you weren't running any vectorized instructions, clock frequency could increase, but This is more about which you want to hoist out of the loop: the squaring, or the inversion. Given fast math flags, LLVM will choose "neither" (it'll replace the single inversion followed by a multiplication on each iteration with repeated divisions). julia> using LoopVectorization, BenchmarkTools
julia> function contrived_example1(x, y)
s = zero(promote_type(eltype(x), eltype(y)))
@inbounds for i in eachindex(x); @simd for j in eachindex(y)
s += inv(x[i]) * abs2(y[j])
end; end
s
end
contrived_example1 (generic function with 1 method)
julia> function contrived_example2(x, y)
s = zero(promote_type(eltype(x), eltype(y)))
@inbounds for j in eachindex(y); @simd for i in eachindex(x)
s += inv(x[i]) * abs2(y[j])
end; end
s
end
contrived_example2 (generic function with 1 method)
julia> function contrived_example_avx1(x, y)
s = zero(promote_type(eltype(x), eltype(y)))
@avx for i in eachindex(x), j in eachindex(y)
s += inv(x[i]) * abs2(y[j])
end
s
end
contrived_example_avx1 (generic function with 1 method)
julia> function contrived_example_avx2(x, y)
s = zero(promote_type(eltype(x), eltype(y)))
@avx for j in eachindex(y), i in eachindex(x)
s += inv(x[i]) * abs2(y[j])
end
s
end
contrived_example_avx2 (generic function with 1 method)
julia> x = rand(200); y = rand(200);
julia> @btime contrived_example1($x, $y)
4.563 μs (0 allocations: 0 bytes)
101147.92090855418
julia> @btime contrived_example2($x, $y)
20.870 μs (0 allocations: 0 bytes)
101147.9209085543
julia> @btime contrived_example_avx1($x, $y)
2.425 μs (0 allocations: 0 bytes)
101147.92090855431
julia> @btime contrived_example_avx2($x, $y)
2.534 μs (0 allocations: 0 bytes)
101147.92090855431 It would be nice if I could rely on LLVM for some of these optimization decisions, but it seems easier to find cases where fastmath flags cause regressions than where they help, once I've already searched the expression to substitute |
I approve your changes. Both the changes to the docs and your replies are very informative as always! Really enjoying learning more about CPUs from an obvious master! |
It occurs to me that one option for the future might be a build step in which we measure the costs on the specific machine on which this is being built. |
Thanks -- I'm flattered. You're obviously a Julia-master, with an impressive array of popular packages used by many/most in the community.
I think this could be done fairly quickly, so it'd be pretty reasonable. |
The new devdocs are wonderful! They are really helping me get a better handle on the internals.
These are a few tweaks that helped clarify my thinking, but different folks might think differently. Feel free to close this if you think it's a step backwards.
The change to the
prettyurls
setting makes it easier to browse the docs in a local build. I also added[compat]
bounds on Documenter because I've had my docs break when new releases are made.One thing I was intrigued by is where the costs come from. I tried to put in one reference, but if you prefer another one by all means use it instead. And I'm pretty vague on what the register pressure is actually measuring (what are its units?), and didn't even try on the
scaling
parameter.